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QUANTITATIVE PREDICTION METHOD 

The present mvention concerns methods and systems for analysis of drug resistance in 
HIV-l. More specifically, the invention provides methods for predicting drug 
resistance by correlating genotypic information with phenotypic profiles. The methods 
5 allow the identification of primary and secondary resistance-associated mutations for 
new and existing drugs and fijr calculating the contribution of mutations and 
combinations of mutations to resistance and hyper-susceptibUity. The invention aUows 
the design, optimization and assessment of fte efficiency of a therapeutic regimen 
based upon the genotype of the disease affecting a patient. 

10 This appUcation claims priority benefit of EP patent application nr. 03101687.6, and of 
U.S. Provisional AppUcation No. 60/478,780 filed on June 16, 2003, the contents of 
which are ejqjressly incorporated by reference herem. AU other pubKcations, patents 
and patent plications cited herein are mcorpotated in fijll by reference. 

15 BACKGROUND 

Techniques to determine the resistance of HIV-l to a therapeutic agent are becoming 
increasingly important Many patients ejqjerience treatment feilure or reduced efficacy 
over time. This is generally due to the virus mutating and/or developing a resistance to 
the treatment. As used herein, «HIV" is the human immunodeficiency vims, which is a 

20 retrovirus. 

The various different anti-HIV-1 agents that have been developed over flie years were 
initiaUy administered to patients alone, as monotherapy. Though a temporary antiviral 
effect was observed, all the compounds lost then- effectiveness over time. Research has 
now demonstrated that one of the main reasons behind treatment failure for all the 
25 antiviral drags is the development of resistance of the virus to the drag (see, for 

example, Laider et al., 1989, Science, 246, 1 155-8). This is largely due to &e abiUly of 
HIV continuously to generate a number of genetic variants in a replicating viral 
population. These genetic dianges generally alter the configuration of the HIV reverse 
transcriptase (RT) and protease (PR) molecules in such a way fliat they are no longer 
30 susceptible to inhibition by compounds developed to target them. If antiretroviral 

therapy is ongoing and if viral replication is not completely suppressed, the selection of 
genetic variants is mevitable and the viral population becomes resistant 1o the drug. 
Since then, dual combination therapy, using drugs that target both HTV reverse 
transcriptase (RT) and protease (PR) molecules, has provided increased control of viral 
repUcation, and thus provided extended clmical benefit to patients. In recent years. 
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however, it has become clear that even patients being treated with triple therapy 
including a protease inhibitor often eventuaUy experience treatment feilure. 
Since patients in the developed world are generally prescribed cocktails of ther^eutic 
drugs, not all HIV-1 infections originate with a wild type but with drug sensitive 
strains, fiom which drug resistance inevitably emerges. As such, with Ihe increase in 
prevalence of drug resistant strains, there comes an increase in infections that actually 
begm with drug resistant strains. Infections with pre-existing drug resistance 
immediately reduce flie drag options for drug treatment and emphasize the importance 
of drug resistance information to optimize initial ther^y for these patients. 
Moreover, as the number of available antiretroviral agents has increased, so has the 
number of possible drug combinations and combination ther^ies. It is therefore very 
difficult, if not impossible, for ^e physician to establish Ihe optimal combination for an 
mdividual. Although there are many drugs available for use in combination therapy 
the choices can quicMy be exhausted and Ae patient can r^idly experience clinical ' 
progression or deterioration if the wrong treatment decisions are made. The key to 
tailored, individuahzed therapy lies in the effective profiling of the individual patienfe 
vims population in temis of sensitivity or resistance to the available drugs. This 
requires the advent of traly individualized tiierapy. 

There are certain solutions to this problem currently in use. 
Phenotyping directly measures the actual sensitivity of a patienf s pathogen or 
malignant ceU to particular ther^eutic agents. However, this can be slow, labor- 
intensive and thus expensive. 

A second approach to measuring resistance involves genotyping tests that detect 
specific genetic changes (mutations) m ihe viral genome which lead to ammo acid 
changes in at least one of the vhral proteins, known or suspected to be associated wifli 
resistance. Although genolyping tests can be performed more rapidly, a problem wifli 
genotyping is that there are now over 100 individual mutations with evidence of an 
effect on susceptibiHty to HIV- 1 dmgs and new ones are constantly bemg discovered, 
in parallel with the development of new drugs and treatment strategies. The 
30 relationshq) between these pomt mutations, deletions and insertions and the actual 

susc^bilityofthevkus to drug ther^y is extremely complex and interactive An 
example of this complexity is Ihe M184V mutation that confers resistance to 3TC but 
reverses AZT resistance. The 333D/E mutation, however, reverses this eifect and can 
lead to dual AZT/3TC resistance. 

35 Sophisticated interpretation is therefore required to predict what the net efiect of these 
mutations might be on tiie susceptibiHty of flie virus population to the various 
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therapeutic agents. Custom algorithms such as mles-based computer algorithms have 
provided some assistance, for example, see International patent application 
WOOl/79540. An overview of this type of technique is presented in Figure 1. 
Beerenwmkel et al., PNAS (Jun 2002), 99(12). pp 8271-8276; Schmidt et al AIDS 
(Aug 2000), 14(12). pp 1731-1738; and Sevinetal.,Jouxnal of Infectious Diseases (Jul 
2000), 182(1), pp 59-67; disclose methods for quantitating the individual contribution 
of a mutahon or combination of mutations to the drug resistance phenolype exhibited 
by HTV based on diferent algorithms such as, respectively, decision trees, a rul^based 
approach and statistical analyses such as cluster analysis, recursive partitionmg linear 
discnminant analysis, fa Schmidt et aL, ATOS Reviews, .4(3), pp 148-156. ferther 
mefhods are reviewed. 

Meisel et al.. Ther^utic Drug Moniimmg (Feb 2001), 23(1), pp 9-14; and Meisel et 
al., Phamiacogenetics (1997), 7(3). pp 241-246; disclose a method for predicting the 
metabohc activity phenotype from the mutation pattern of the NAT-2 gene by multiple 
hnear regression analysis. The hnear regression model describes a quantity Rs the 
metabohc ratio built up by an error term (first temi) and a smn of products buill up 
from a mutation fector multiplied by a mutation-dependent resistance coefficient 
However, given the nature of the NAT-2 genotypic patterns, the above methods do not 

consider the relationship between point mutations within a genolypic pattern In 
particular, the quantitative prediction methods proposed are merely an addition of 
ii«lependent variables where effiwsts such as antagonism or synergy between point 
mutations, msertions or deletions are not taken mto account. 

There remains a continuing need for the quantitative prediction of HIV dmg 
susceptibihty from viral genotjrpe. In particular, there is a need for quantitative 
prediction methodologies like Mnear regression modelling which can grasp the 
complexity of the HIV-1 genotypicto-phenotypic ^mamics, i,e. combinatorial effects 
such as antagonism and synergism. Furthermore, because the majority of HIV patients 
have now been exposed to drug cocktails, it is thought that the disease-cansmg 
retroviruses tend to spontaneously generate mutations that have often co-evolved This 
makes the analysis of which mutations are responsible for resistance to which drugs 
almost impossible using currently available techniques. It also means that mutations 
that contabute to resistance are being overlooked usmg the currently available analysis 
techniques. 

It is therefore an aim of the present invention to provide methods for improving the 
35 interpretation of genotypic results. 
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It is a fiirfher aim of ihe invention to provide methods for detemiining (or predicting) a 
phenotype based on a genotype. 

It is also a further aim of the invention to provide methods forpredicting the resistance 
of an mv variant of a particular genotype to a therapy or a therapeutic agent. 
It is also an aim of the invention to predict resistance of a patient to therapy. 

It is also an aim of the invention to provide methods to assess the effectiveness or 
efficiency of a iher^y or to optimize a patient's therapy. 

It is also an aim of flie mvention to identify novel HIV-1 mutations that are associated 
with resistance to particular drug ther^ies or combination ther^ies. 

SUMMARY OF THE INVENTION 

A solution to these problems involves new methods for measuring drug resistance by 
correlating genotypic mformation with phenolypic drug resistance proffles measured 
e>qperimentally. 

According to a &st aspect of the invention, there is provided a method for quantitating 
the mdividual contribution of a mutation or combination of mutations to the drug 
resistance phenotype exhibited by HIV, said method comprising the steps of. 
a) performing a linear regression analysis using data from a dataset of matching 
genotypes and phenotypes, whereby the log fold resistance, pFR, is modelled as the 
sum of aU the individual resistance contributions for each of the mutations or 
combinations of mutations that occur in HIV according to the foDowing equation; 

wherein each individual resistance contribution is calculated by multiplying a mutation 
tactor, Ma,Mb , .... Mz, for each mutation or combination of mutations by a 
resistance coefficient A4, fiB. fiz 

wherein the mutation factor assigned to each mutation or combination of mutations 
reflects the degree to which that mutation or combination of mutations is present in the 
HIV stram and, if present, to which degree the mutation is present in a mixture; 
wherein each resistance coefficient reflects the contribution of the mutation or 
combmation of mutations to the fold resistance exhibited by the strain; 

and wherein the error term e, represents the difference between fte modeUed prediction 
and the experimentally determined measurement 

This method involves a data driven technique for quantitative drag susceptibility 
prediction. This method uses a multiple Imear regression model to estimate coefficient 
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values fhat accurately reflect the contributioii made by a particular HIV mutation or 
combination of mutations to resistance to a particular drag. Repeating the method for 
each candidate therapeutic drug allows liie compilation of a global picture of drug 
resistance exhibited by a particular mv strain. 

This mefcod has allowed the identification of mutations hitherto unrecognized as 
having an effect on drug resistance in mv. The method also aUows the identification 
of primary (single mutations) and secondary (the co-occurrence of two mutations) or 
higher order terms resistance-associated mutations for new and existing drugs, ^ch 
embrace the antagonistic and synergistic phenomena. Accordmgly, a &rfher aspect of 
the invention provides a method of identifying a mutation that affects the degree of 
drug resistance exhibited by an HIV strain using a method according to the first aspect' 
of the invention. 

The method of the first aspect of the invention is also advantageous over current 
methods since it allows the quantitative, purely data-driven, objective.assessment of the 

15 contribution of mutations and combinations of mutations to drug resistance. The 
method also allows the de-convolution of the individual contribution made by 
particular mutations to the drug resistance phenotype. Unlike existing methods, the 
method is able to correct for correlating mutations that on the fece of it appear to affect 
drug resistance, but which m feet only correlate m their occurrence wift resistance 

20 causmg mutations and are themselves pheno^ically silent. 

The method has allowed the design of an automated computational technique for the 
prediction of the drug resistance profile possessed by a particular HIV strain infecting a 
patient The methods thus allow the determination of a patient phenotype without 
having to perform any phenotypic testmg i^tsoever. This has clear ramifications for 
25 the bespoke design, optimisation and assessment of strategies for mdividual patient 
therapy based upon the s^olype of the infecting agent 

The invention also provides diagnostic kits for performing each of the methods of the 
invention described herein. 

30 DESCRIPTION 

In any population of HIV variants, there is a wide distribution of drug resistance 
phenotypes for any particular drug, ranging from hyper-susceptibiUty to strong 
resistance (see Figure 2). The expression "drug resistance phenotype" means the 
resistance of an mV vkus to a tested therapy, therapeutic agent or drag. The term 
35 '"resistance" as used herein, pertauis to the cq)acity of resistance, sensitivity, 

susceptibihty, hyper-susceptibility or effectiveness of aflierapy against a disease. The 
term 'therapy" includes but is not limited to a drag, pharmaceutical, or any other 
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compound or combination of compounds that can be used in therapy or therapeutic 
treatment of mv. Tias distribution of drug resistance reflects the large number of 
di£ferentgenotypesthatarepresentin1hepopulation. Some variante may only have 
one mutation that is correlated vwth drug resistance, whilst others wiU have several or 
numerous such mutations, each of which may impart its own contribution to the drug 
resistance phenotype. 

Adding an additional level of complication ate the phenomena of antagonism, synergy 
and enhancement, where certain mutations may add to or detract from the effect of 
olher mutations in a manner not predictable fewn studying the effects of the individual 
mutafaons alone. Hi^y correlated mutations are also problematic. Theseaie 
mutations that ahnost always co-occur in a strain, but only one of the mutations 
actually has an effect on drug resistance. For example, when one of these 2 mutations 
has an effect on resistance and the other mutation does not (this mutation might for 
example be highly correlated with the resistance mutation because it affects the 
rephcation rate of the vims), the effect can erroneously be assigned to either one of the 
mutations. 

Examples of mutations known or suspected to influence the sensitivity of HIV to drug 
ther^y may be found on tiie internet at http://hiv-web.lanl.gov; 
ht^://hivdb.stanford.edu/hiv/; orht^p://www.viral-resistance.c<^m. 
In mv, two sections of flie genome are generally studied: Protease (PR) and Reverse 
Transcnptese (RT). The methods of the present invention can equaUy be applied to 
other sections of the HIV genome such as Integrase (IN). A mutation is presented as a 
number referring to the position in tiie protein, foUowed by the amino acid(s) on that 

posmon,ifit differs from flie amino add in file HXB2HIV reference. Inflie terms 
25 inchided above, the muteti<Mis are represented as "A", "B", ."Z". 

Mixtures reflect flie diversity of ihe HIV population in a sample. It means tiiat on that 
position two subsets of tiie population have a different amino acid. Mixtures are 
denotedby separating amino acids wifli flie character: 65K/R (mixture of 'K' and 
•R' at position 65). 

When more than two amino acids are fomid on a certain position in subsets of ihe 
population, the dummy amino acid 'X' is used. 

Insertions are denoted by adding the insert position behind a dot: 69.2S (an msert of 'S' 
at msert position 2). Deletions are denoted by a minus sign: 69-. 

Examples of mutations present in tiie RT domain of HIV conferring resistance to a 
reverse transcriptase inhibitor include 69C, 69V, 69T, 75A, 1011, 103T. 103N, 184T, 
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188H, 190E, 219N, 219Q, 221Y, 221 1, and 233V. Additional examples of mutatioos 
present m the protease (PR) domain of HIV conferring resistance to a reverse 

transcriptase inhibitor include 24H48A, and 53L. A mutation xnay a^t resistance 
alone or m combination with other mutations. 

For the purposes of the invention, the mutations identified should be associated with 
resistance or susceptibility to drug therapy, for example an antiretroviial drug The 
degree to which a particular mutation pattern may affect resistance may be determined 

r J^v^' "^^'^ ^"^'^ ">°^itoring assay 

such as, the ANTIVIROGRAM® (Virco, Belgmm) (see WO97/27480). In this 

^TrZ: '""""^"^ ^ ^"^^ ^ strain 

HIVLAimiB. 'niedifferenceinIC5o(theconcentrationofdn^requiredtoreducethe 
vmas growth in cell culture by 50o/o)betweenthepatient sample andlhereference viral 
sttam ^s determined as a quotient. This fold change in IC50 is reported and mdicative of 
&e resistance profile of a certain dmg. Based on the changes m IC,„, cutoff values 
have been established to distinguish a sample from bemg sensitive or resistant to a 
certain drug. 

Various proj ects are underway to compile data relating to the conespondence of certain 
mutations with drug resistance phenotype, and these generally lead to the generation of 
relational databases of tables that iUustmte the matching genotype / resistance 
phenotype for various antirelToviral drugs. Such databases bring together the 
knowledge of both a genotypic and phenotypic database. The phenotypic database 
contams phenotypic resistance values for HIV to at least one ther^y, preferably 
multiple dmg thei^ies. For example, flie phenotypic resistance values of tested HIV 
vmises. with a fold resistance detemimation compared to flie reference HIV virus (wild 
type). ^ 

•nie dalaset used herein is a dalaset developed by the Applicant, which consists of a set 
of matchmg genotype /phenotype measurements wilhpossible multiple phenotype 
measureme^ per genotype. However, any shnilar dataset may be used, provided that 
to are sufiScient entries for each genotype /phenotype measurement for tiie data 1« 
be sigmficant In the Virco dataset, tiie mutations are defined relative to HXB2 at 
amino acid level. 

The phenotypes are presented aspFR values, where is equal to - log (FR) where 
FR denotes the Fold Resistance. Negative^/=K values thus denote resistance 
positive values denote hjper-susceptibilily. For example, a^F/? value of -1.0 is equal 
to lO-fold resistance. An example of the pFR distribution for Saquinavir (SQV) 
shown m Figure 3 . Figure 4 shows the pFR distiibution for the «48V" mutation on 
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SQV. It is clear from this that the 48 V subset does not behave the same as the whole 
dataset 

Hie problems of unwanted correlatioiis between mutations where not all correlated 
mutations contribute to the drug resistance phenotype are iUustiated in Figure 6. Here, 
5 tiie left hand panel shows the pER dustribution for the 711 mutation. When the e£fects ' 
of mutations 48V and 84V are removed (right hand panel), the pFR m the distribution 
of variants is markedly increased (less drug resistance). 

According to the invention, the predicted fold resistance of an BdV strain of a particular 
genotype may be calculated by summing the individual resistance contributions for 
10 each of the mutations or combinations of mutations in the mutation pattern of that 

genotype. The method uses linear regression models, so flmt the phenotype prediction, 
pFR is calculated in the following equation (1): 

pFR = p^M^ + +...+ p^M^ +e 

15 The independent variables Ma, Mb , Mz, are referred to herein as mutation 
fectois, each of which reflects the degree to which the mutation or combination of 
mutations is present in the HIV strain and, if present, whether or not the mutation is 
present m a mixture. 

The resistance coefficients Ai, fiB. fiztepresent the contribution to the tota[ pFJR 
20 prediction for each angle mutation. 

Each mutation fector Mt thus represents the presence or absence of the corresponding 
mutation and each coefficient ^ represents the contribution to the pFR change for tiiat 
specific mutation. 

The mutation fector may take into account I'* order terms (single mutations) as weU as 

25 2 order terms (the co-occurrence oftwo mutations) and in general n* order terms. 
For exsax^le, 2*^ onier terms take the form: 

The independent variable MAS represents the co-occurrence of mutations A and B and 
the coefficient represents tiie synergy or antagonism between mutations A and B, 
30 When tiie mutation factor embraces n"' order terms, thus the co-occmxence of two ore 
more mutations, the terms take the form: 

B M 

wherein M„ represents the co-occurrence of one mutation witii otiier one or more 
mutations -such as duplets, triplets, quadnq>les, etc. and the coefficient fin represents 
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the synergy or antagonism between the one mutation with the other one or more 
mutations; thus n will ^ly for instance to the combinations of mutations, AB, ABC 
ABCD,BC,Aa>,etc. 

As such, the linear regression model may take the following equation: 
pFR = p^M^ + ^^M^ + +...4. ^^M^ 

Higher order terms affect for interactions between mutations: 

• antagonism or reversal : positive pFR shifts for mutation couples. 

• synergy or enhancement extra negative pFR shift fijr mutation couple. 



Mutation 


Coefficient 


84V 


-0.46 


50V 


-0.92 


54M 


-0.64 


88S 


0.63 


90M 


-0.16 


46L 


-0.19 


46L&84V 


-0.09 



Consider a virus with following mutations: 31, 46L, 84V and 90M. Applying equation 
(1), this virus will have a pFIt prediction: 

pFR = p^^.l^ p^^.i+ P90i,/1+ P46W1 = -0-9 
or ahnost 8-foId resistance, le.pFR = -logFR. 

Note that in the model F and A are synergistic since their co-occurrence decreases the 
pFR by an extra -0.09, 

The error term is e, which is the diflference between the prediction and the 
measurement. This error term contains both the measurement error on the phenotype 
measurement and a model error (if the underlying model has higher order terms that 
are not taken into account in the r^ession model). 

Mutation fectors for single mutations (Ma, Mb. .... Mz) are calculated as follows: 
if ttie mutation is present in the HIV strain, a positive mutation fector is assigned; 



wo 2004/111907 



PCT/EP2004/051084 



-10- 



10 



if Hie single mutation is not present, tiie mutation fector assigned is zero; 

if the single mutation is present in a mixture, an averaged positive mutation fector is 

assigned. 

Ckmvenienfly, mutation fectors range between 0 and 1 where 0 means notpresent and 1 
means present Values between 0 and 1 means that the mutation is present in a 
mixture. Accordingly, a positive mutation fectcar is assigned flie value 1 . 

Mixtures are modeUed as causing the average shift of its constituent mutations Since 
methods for the quantitation of the precise proportions of mixtures to wild type are 
expensive and time^onsuming, mixture with wild type may conveniently be treated as 
causmg half ihepFR shift of the resistance mutation (mutation fector = 0.5). However 
as the skilled reader will appreciate, a more precise mutation fector may be assigned if 
the trae proportion in the mixture is known. 

Mutation fectors for double mutations {Mab etc) are calculated as follows; 

if both the mutations are present in the HIV strain, apositive mutation fector is 
15 assigned (conveniently, the value 1); 

if neither of the mutations are present, the mutation fector assigned is zero; 

if both mutations are present and one mutation is present in a mixture, an averaged 

positive mutation fector is assigned (conveniently, 0.5); 

if both mutations are present in a mixtare, a reduced averaged positive mutation fector 
IS assigned fm this example, 0.25). The fector 0.25 is the product of the M-fectors of 
both fte single constituent mutations. TTiis is the result of the assumption feat these 
mixtures are independent of each other. Of course, flds is an approximation, since in a 
real blood sample, the mixtures are not independent of each ofeer. For example, if only 
2 viruses were present, virus A (no mutations) for 70% and virus B (mutations 461 and 
25 84V) for 30%, then a mixture would be detected on both positions 46 and 84. If these 
concentrations were known, it would be possible to fine tune the mutation fector of 
0.25. If this information is not available, the best statistical guess is 0.5*0.5; this being 
the average value that would be measured for the mutation couple being present for a 
population of samples fliat have these mixtures on 46 and 84 in all possible 
30 concentrations. 

Similarly, the mutation factors for triple mutations {Mabc etc) shall be calculated as 
follows: 

if the three mutations are present in the HIV stram, a positive mutation fector is 
assigned (convenientiy, the value 1); 
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if neither of the three mutations are present, the mutation fector assigned is zero; 

if the three mutations are present and one mutation is present in a mixture, an averaged 
positive mutation factor is assigned (conveniently, 0.5); 

if the three mutations are present and two mutations are present in a mixture, a reduced 
5 averaged positive mutation factor is assigned (in fliis example, 0.25). TTie fector 0 25 is 
the product of the M-fectors of the two single constituent mutations present in ttie 

inixtiire; 

if the three mutations are present in a mixture, a reduced averaged positive mutation 
fector IS assigned (in this example, 0.125). The fector 0.125 is the product of the M- 
1 0 fectois of the three single constituent mutations. 

IHe calculation of the mutation fectors for higher order terms shall take the same 
principle. 

Calculation of the resistance coefficient (Pa, fis. fiz fiAB) is perfomiedby 
evaluating the dataset for the drug phenotype reported for each mutation or 
15 combination of mutations. 

The problem of unwanted correlations has been discussed above. Unwanted 
correlations are removed according to fbe methods of the invention. 

One way to do this is to use an algorithm that has been developed by the inventors to 
track the change in pFR as the effects of individual mutations or combinations of 
20 mutations are removed ftom the dataset. The ef&ct of each mutation or combination of 
mutations is thus separated out. The methodology follows mutation ti^ectories 
towards the global average as the effects of individual mutations or combinations of 
mutations are removed. The steps are as follows: 

a) calculateaveragepFRforallmutationswifhasulficientcomitinthedatabasetobe 
25 significant; 

b) detennme the extremes (maximum, minimum), and select the mutation with the 
pFR forthest away fiiom the global average; 

c) remove aUvkus strains that have the selected mutation and reiterate from step a); 

d) stop when deselected mutation in step b) has an average pFR that approximates to 
30 the global average. 

In this manner, mutations that do not cause resistance, but which are often present with 
mutations that do cause resistance will have a higher average pFR (less resistance) 
Removmg the virus strains with a certain resistance causing mutation results in an 
mcrease of the average pFR for correlating mutations. 
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A suitable threshold at which a count in the database becomes sufficiently significant 
win be parent to the skilled reader and will be dependent on the database size. For 
example, thresholds of 5, 10, 15, 20, 25, 30 or more may be suitable. In the examples 
discussed herem, a threshold of 20 times was used. 

By an "average pFR that ^roximates to the global average" is meant that the average 
pFR is within a ftaction of flie standard deviation of the remaining population. A 
convenient fraction ranges between about 0.3 and 0.5. 

A comparison of the change in the global average pFR with the change in the average 
pFR for selected mutations with increasing iterations of the algorithm is shown in 
Figure 7. Figure 8 shows an example, where the average pFR for 711 (unwanted 
correlation) jumps up as a result of removing fiom Are dataset virus strains that have 
'*71I & 84V" and "48V" mnlations. 

An alternative, analogous methodology for removing unwanted correlations is as 
follows; this is an extension of the mutation tr^ectories algorithm discussed above. 
1 5 The st^s of this method are as follovi^: 

a) calculate correlation coefiBcient between all mutations (with a sufficient count in 
the database) and flie pFR; 

b) determme the extremes (maximum, minimum), and select the mutation with the 
highest (absolute value of) correlation coefficient; 

20 c) calculate a Knear model for the pFR with the selected mulation(s) (from step b), all 
previous iterations); 

d) take the residue (pFR minus the predicted value from the model); 

e) calculate correlation coefficient between all mutations (wifli a sufficient count in 
the database) and the residue; 

25 f) detennme the extremes (maximum, minimum), and select the mutation with flie 
highest (absolute value of) correlation coefiBcient; 

g) calculate a linear model for the pFR with the selected mutation(s) (from step f), aU 
previous iterations); and 

h) reiterate from step d); 

30 i) stop when the selected mutation in step g) has a correlation coefficient that 
^>proximates to zero. 
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As with Hie mutation trajectories algorithm described above, the effect of mutations 
that do not themselves cause resistance, but which are often present with mutations that 
do cause resistance, is excluded and thus does not distort the real values. 

In a more preferred methodology for removing unwanted coirelations, a stepwise 
5 selection regression may be appKed, which method selects the variable with the highest 
effect. The steps oflhis method are as follows: 

a) perfonn a fiist order regression ftom the list of mutations that occur in the dataset; 

b) calculate tiie p-value for all mutations; 

c) select the mutation with the lowest p-vahie and add it to the model; 
10 d) re-calculate the regression model; 

e) reiterate fiom step b); 

f) stop when the re-calculation of the p-values of step b) gives no significant values 
anymore. 

Usually, this methodology is run m statistical software packages, which itemtively 
15 model the residue from the previous regression as the dependent variable. 

pFR = Interc^tr + P^i*/^ +e, 
e, = Intercept^ + P^M^ +ej 
Ej = Intercept, + p^M^ +8, 

i 

• • • 

The p-value for a given mutation is the probability of rejecting the true null hypothesis, 
where the null hypothesis is defined as follows: the coefficient of that parameter equals' 
zero. In other words, the p-value is the probabiUty tbat the real coefficient for a certain 
parameter is zero, while the model predicts a coefficient different ftom zero. 
The expression "significant value" for the p-values refers to the mutations with a p- 
value that is lower than the threshold selected by the user. This threshold may be 
30 determined as foUows: the user shall create linear models for a whole range of 

combinations of p-values (a p-value for the first-order iteration and a p-value for the 
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second-atder iteration). For each combination, the mean squared error (MSE) of the 
corresponding model is calculated on unseen data. The combination of p-values that 
results in Ihe model wifli the lowest MSE, i.e. the combination for which the model 
gives the best predictions, is chosen. 

5 In forther preferred embodiments of the invention, problems of small datasets for 
particular mutations or combinations of mutations are dealt with by applying the 
method recursively to the set of virus strains that exhibit those particular mutations or 
combinations of mutations. 

In still fijrflier preferred embodiments of the invention, the following additional 
1 0 correlations are taken into account: 

• multiple entries of the same virus strain (or virus strains grown ftom Ihe same 
stock solution) that cause unwanted correlations; 

• censored values in genotype / phenotype database (for example, EC50 value = 
'< l\iM'). These are phenotypes beyond the assay range, thus when the 
phenotypic value is smaller than flie measurable range, a '<'-censor is applied to 
fliat value. Analogously, a ♦>'-censor is applied to the value, if it is higher tiian the 
measurable range. 

Preferably, censored values are dealt with by attempting to construct a model that is 
consistent ftom extrs^lations. Censored values are thus modeled by replacing the 
censored value by a maximum likelihood estimation, assuming knowledge of tiie 
standard deviation of the measurement error. 

A preferred technique for the generation of a maximum likelihood estimation is as 
follows: 

a) calculate a linear regression model without censored values; 

25 b) use the phenotypic measured value Vo as if tiie censor was « = e.g. when a result 
is expressed as -log FR < 4, we will treat Vo as -log FR =4; 

c) look at die prediction P from the model and ^lyeitiier: 
Case *<*-cen{ir>r- 

■ P < Vo - 0.798 o (center of gravity of half Gaussian distribution) 
30 o Remove value from training data for next iteration 

■ Vo-0.798o<P<Vo 

o Use V = Vo - 0.798 o for next iteration 

■ Vo<P 
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o Use V centre of gravity of tail (<V) of a noimal distributioii N (P, c) as 
value for next iteration. 

Case *>' -censor: 

■ P>Vo + 0.798 o (center of gravity of half Gaussian distribution) 
5 o Remove value from training data for next iteration 

■ Vo + 0.798o>P>Vo 

o Use V' = Vo- 0.798 o for next iteration 

■ Vo>P 

o Use V centre of gravity of tail (>V) of a normal distribution N (P, o) as 
10 value for next iteration. 

d) calculate a linear regression model and for the censored values in the linear 
regression model, either remove the data-point ftom tiie training set, or use V 
instead of the censored phenotypes measurement, as described in step c); 

e) re-iterate ftom step b) until the prediction converges. 

15 Accordingly, for each iteration, when the prediction and measurement contradict, 
censored values are taken mto account. When flie prediction and measurement are 
strongly consistent, censored values are disreganted, on flie basis that no lurflier ' 
information is provided and their inclusion has no additional vahie. 

In one preferred embodiment of tiiis aspect of tiie invention, flie number of calculations 
necessary m the linear regression analysis may be reduced. The computational power 
and memory reqmrement tiiat is currenfly generally available is insufficient to allovi^ a 
full second order model to be evaluated for a large dataset, based on all possible single 
mutations and second order terms, since the number of terms increases quadratically 
witii the number of mutations considered. Hiis number increases witii a larger dataset 
25 since more rare mutations are in a large database. 

In order to reduce tiie amount of terms, a first order regression may be performed from 
the list of mutations that occur in the dataset above a tiireshold number of times. A 
suitable threshold at which a count in the database becomes sufficientiy significant will 
be apparent to tiie skilled reader and will be dependent on the database size. For 
30 example,ti.resholdsof5,10,15,20.25,30ormoremaybesuitable. In the examples 
discussed herein, a tiireshold of 20 times was used. The significant terms from this first 
order regression are wifliheld and tiie list of tiiese temis is flien used to perform a 
second order regression. In tiie second order regression only flie single mutations and 
combinations of mutations are used fliat were found significant in flie first order model. 
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Agam, a threshold significance will be apparent to the skilled reader - an exainple is if 
the probaWlily that the real vahie of the term is 0, is smaller than 0.001 . 

For example, a first order regression performed on the matching genotype /phenotype 
dataset for mdinavir (34,445 measurements) for those mutations that occur at least 20 
5 times results in a first order model that withholds a list of 94 single mutations that are 
considered significant. 

This list is then used as a starting list for a second order regression. It should be noted 
that it may be advantageous to exclude certain very common mutations from the 
calculation. 31 is one example. The reason is that a mutation must occur at least a 
10 threshold number of times and the inverse also has to be true: the count of viruses not 
having the mutation 31 or the couple not 31 and another rautotion should also be above 
the threshold value (e.g. 20). Taking this into account results m excluding 31 ftom the 
r^jression in practice. 

In the second order regression, all the single mutations and all couples of mutations 

15 from Ae list are used as potential terms. The significant temis are flien witiiheld by flie 
regression algorithm. 

Accoidmg to a fijrther aspect of the invention, there is provided a mefliod of calculating 
tiie quantitative contribution of a mutation pattern to flie drug resistance phenotype 
exhibited by an HIV strain, said method con^rismg the st^s of: 
20 a) obtaining a genetic sequence of said HTV strain; 

b) identifying the pattern of mutations in said genetic sequence, wherein said mutations 
are associated with resistance or susceptibiHty to drug therapy; and 

c) calculating the fold resistance of the HIV strain as compared to the wild type HIV 
strain by performing a linear regression analysis, whereby the log fold resistance, pFR, 

25 is modelled as tiie sum of all ihe individual resistance contributions for each of fee 

mutations or combinations of mutations tiiat occur in said HIV strain according to the 
following equation; 

pFR = p^Af^ + p^M^ + . . .+ p^M^ +e 
wherein each individual resistance contribution is calculated by multiplying a mutation 

30 fector. Ma , Mb Mz, for each mutation or combination of mutations by a 

resistance coeflBcient pA, fiB. fiz, 

wherein the mutation factor assigned reflects the degree to which the mutation or 
combination of mutations is present in tiie HIV stram and, if present, to which degree 
die mutation is pres«it in a mixture; 
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wherein each resistance coefficient reflects the contribution of the mutation or 
comhination of mutations 1o the fold resistance exhibited by fte strain; 

and wherein the error term e, represents the difference between the modelled prediction 
and the experimentally determined measurement. 

5 As the skilled reader will appreciate, the fold resistance of the HIV strain may be 
calculated using any one of the embodiments of the invention referred to above. 
In the first step of this method, the genetic sequence of an HIV strain should be 
obtained. Normally, this wiU be the genetic sequence of an HIV strain with which a 
patient is infected, although the sequence may be a theoretical sequence, for example 
10 for purposes of in silico modelling. 

The mefliod may thus be used as a diagnostic method for predicting the fold resistance 
exhibited by a particular HIV strain with which a patient is infected. According to 
other preferred embodiments, the method may be used for assessing the efficiency of a 
patient's therapy or for evaluating or optimising a therapy. The method may be 
15 performed for each drug or combmation of drugs currently being administered to the 
patient so as to obtain a series of dmg resistance phenotypes and thus to assess the 

efifectofapluraHty of drugs or drug combinations on the predicted fold resistance 
exhibited by the HIV strain with which the patient is mfected. 

A "patienir may be any organism, particularly a human or otiier mammal, suffering 
20 fiommv or AIDS or in need or desire ofti»atment for such disease. A patient 

includes any mammal and particularly humans of any age or state of development. 

To obtain an HIV strain fixMn a patient, a biological sample will need to be obtained 
from tiie patient A "biological sanq)le" may be any material obtained in a direct or 
indirect way from apatient containing HIV virus. A biological sample may be 
25 obtained from, for example, saliva, semen, breast milk, blood, plasma, feeces, urine, 
tissue samples, mucous samples, ceDs m cell cuhure, cells which may be forther 
cultured, etc. Biological samples also include biopsy samples. 

The genetic sequence of an HIV stiain may be evaluated by a number of suitable 
means, as will be clear to tiiose of skill in tiie art. Most suitable will be techniques that 
30 allow for specific nucleic acid amplification, such as tiie polymerase chain reaction 
(PGR), alfliough otiier techniques such as resbiction fiagment lengfli polymorphism 
(RPLP) analysis will be equally applicable. 

Nucleic acid sequencing tiien allows tiie analysis of flie mutation pattem in a particular 
nucleic acid sequence, eitiier by classical nucleic sequencing protocols e. g. extension 
35 chain termination protocols (Sanger technique; see Sanger F., Nicher., Coulson A. 
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Pioc. Nat. Acad. Sci. 1977, 74. 5463-5467) or cham cleavage protocols. Such methods 
may employ such enzymes as fte Klenow fiagment of DNA polymerase I, Sequenase 
(US Biochemical Coq,, aeveland, OH), Taq polymerase (Perldn Ehner), thermostable 
T7 polymerase (Amersham, C3iicago. IL), or combinations of polymerases and proof- 
leading exonucleases such as those found in the ELONGASE AmpUfication System 
marketed by Gibco/BRL(Gaiaersbuig.MD). Preferably, the sequencing process m^ 
be automated using machines such as the Hamilton Micro Lab 2200 (Hamilton Reno 
NV). the Peltier Thermal Cycler (PTC200; MJResearch, Watertown. MA) and the AJ^I 
Catalyst and 373 and 377 DNA Sequencers (Peridn Ehner). Particular sequencing 
methodologies have been developed further by companies such as Visible Genetics 
Any of the novel ^roaches developed for muaveling the sequence of a target nucleic 
acid, either now or in the foture will be perfectly applicable to the analysis of sequence 
m the present invention (including but not limited to mass spectrometry, MALDI-TOF 
(matrix assisted laser desorption ionization time of flight spectroscopy) (see Graber J 
Snn1hC..Cant<MrC. Genet. AnaL 1999, 14, 215-219) chip analysis (hybridization ba^ 
techmques) (Fodor S P ; Rava RP ; HuangX C ; Pease A C ; Hohnes C P : Adams C L 
Nature 1993, 364, 555-6) It should be appreciated ftat nucleic acid sequencing covers 
both DNA and KNA sequendng. 

Once the genetic sequence of ihe HIV strain is known, the pattern of mutation must be 
Identified in the sequence. The term "mutation" as this is used herem, encompasses 

both genetic and epigenetic mutations offlie genetic sequence of wild type mv A 
genetic mutation includes, but is not limited to, (i) base substitutions: smgle nucleotide 
polymorphisms, transitions, ti^sversions, substitutions and (ii) frame shift mutations- 
msertions, repeats and deletions. Epigenetic mutations include, but are not limited to 
alterations of nucleic acids, e. g., methylation of nucleic acids. One example includes' 
(changes m) metiiylation of cytosine residues in the whole or only part of the genetic 
sequence. In die present invention, mutations will generally be considered at flie level 
of the amino acid sequence, and comprise, but are not limited to, substitutions, 
deletions or insertions of amino acids. 

The "control sequence^ or "vnld type" is the reference sequence ftom which tiie 
existence ofmulations is based. A conti-ol sequence for mV is HXB2. This viral 
genome comprises 9718 bp and has an accession number in Genbank at NCBI M38432 
or K03455 (gi number : 327742). 

Identifymg a mutation pattern m a genetic sequence under test thus relates to the 
Identification of mutations in the genetic sequence as compared to a wild type 
sequence, which lead to a change in nucleic acids or ammo acids or which lead to 
altered expression of tiie genetic sequence or ^[tered expression of the protem encoded 
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by fhe genetic sequence or altered ejqjression of the protein under control of said 
genetic sequmce. 

A "mutation pattern" comprises at least one mutation influencing sensitivity of HIV to 
a therapy. As such, a mutation pattern may consist of only one single mutation 
Alternatively, amutation pattern may consist of at least two. at least three, at least four 
at least five, at least six, at least seven, at least eight, at least nine or at least ten or mor^ 
mutations. A mutation pattern is flius a Hst or combination of mutations or a list of 
combmations of mutations. A mutation pattern of any particular genetic sequence may 
be construclBd, for example, by comparing flie tested genetic sequence against a wild 
type or control sequence. The existence of a mutation or Hie existence of one of a 
group of mutations can then be noted 

One way in which this may be done is by alignmg the genetic sequence under test to a 
wild type sequence noting any differences in the aKgmnent T^cal aHgmnent 
methods include Smith-Waterman (Smith and Watemian, (1981) J Mol Biol 147- 195 
197), Blast (Altschul et al (1990) J Mol Biol., 215(3): 403-10). FASTA (Pe^on & 
Lipman, {19%^) Proc Natl Acad Sci USA ; 85(8): 2444-8) and, morerecenfly PSI- 
BLAST (Altschul et al. (1997) Nucleic Acids Res., 25(17): 3389-402). Itniy in some 
cm:umstances be preferable to generate alignments using a multiple alignment 
program, such as ClustalW (Jhomp^oa et al, 1994, NAR, 22(22), 4673-4680) Other 
suitable methods will be clear to those of sldll in the art (see also "Bioinformatics • A 
practical guide to the analysis of genes and proteins" Eds. Baxevanis and OueUette 
1998. John Wfley and Sons, New Yoik). A practical example of multiple sequence' 
aligmnent is the construction of a^iylogenetic tree. A phylogenetic tree visualizes fee 
relationship between different sequences and can be used to predict future events and 
retrospectively to devise a common oiigm. This type of analysis can be used to predict 
a sumlar dmg sensitivity for a sample but also can be used to unravel tiie origm of 
different patient sample (i. c. tiie origin of flie viral sttain). 

In this mamier, tiierefore, tiie pattern of mutations in flie genetic sequence can be 
Identified, wherein said mutations are associated wifli resistance or susceptibility to 
dmg tiierapy exhibited by fhe HIV strain tested THe mutation pattern may influence 
sensitivity to a specific therapy, e. g., a drug, or a group of ther^ies. The mutation 
pattern may, fox example, increase and/or decrease resistance of the HIV strain to a 
flier^y. Particular mutations in flie mutation pattern, may also, for example, enhance 
and/or decrease flie influence of oflier mutations present in flie genetic sequence fliat 
35 effect sensitivity of the HIV strain to a tiierapy. 
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The invention fiiither relates to a diagnostic system as herein described for use in any 
of the above described methods. An example of such a diagnostic system, for 
quantitating tiie individual contribution of a mutation or combination of mutations to 
the drug resistance phenotype exhibited by an HIV sttain, comprises: 
5 a) means for obtaining a genetic sequence of said HTV strain; 

b) means for identifying the mutation pattern in said genetic sequence as compared to 
wild type HIV; 

c) means for predicting the fold resistance exhibited by the HIV strain usmg any one of 
the methods described above. 

10 The means fi»r predicting the fold resistance are preferably con^>uter means. 

A still ferther aspect of the invention relates to a computer apparatus or computer-based 
system ad^ted to perfomi any one of flie methods of flie invention described above, for 

example, to quantify the individual contribution of a mutation or combination of 
mutations to the dmg resistance phenotype exhibited by HIV, or to calculate the 
quantitative contribution of a mutation pattern to the drug resistance phenotype 
exhibited by an HIV strain. 

In a preferred embodiment of the invention, said computer apparatus may comprise a 
processor means incorporating a memory means adapted for storing data; means for 
inputting data relating to the mutation pattern exhibited by a particular HIV strain; and 
computer software means stored in said computer memory that is adapted to perfom a 
mefliod according to any one of the embodiments of the invention described above and 
output a predicted quantified drug resistance phenotype exhibited by an HIV stiain 
possessing said mutation pattern. 

A computer system of this aspect of the invention may comprise a central processing 
25 uni^ an mput device for inputting requests; an output device; a memory; and at least 
one bus comiecting the central processing unit, the memory, the input device and the 
outputdevice. The memory should store a module fliat is configured so that upon 
receiving a request to quantify the individual contribution of a mutation or combination 
of mutations to the dmg resistance phenotype exhibited by HIV, or to calculate the 
quantitative contribution of a mutation pattern to the drug resistance phenotype 
exhibited by an mv strain, it perfomis the steps listed in any one of the methods of the 
invention described above. 

In the ^paratus and systems of these embodiments of the invention, data may be input 
by downloading the sequence data firom a local site such as a memory or disk drive, or 
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alternatively fiom a remote site accessed over a network such as the mtemet. The 
sequences may be input by keyboard, if required. 

The generated results may be output in any convenient fonnat, for example, to a 
printer, a word processing program, a graphics viewing program or to a scr^ display 
device. Otter convenient formats will be apparent to the skilled reader. 

The means adapted to quantify the individual contribution of a mutation or combination 
of mutations to the drug resistance phenotype exhibited by HIV, or to calculate the 
quantitative contribution of a mulation pattern to the drug resistance phenotype 
exhibited by an HIV strain wiU preferably comprise computer software means. As the 
skilled reader will appreciate, once the novel and inventive teaching of the invention is 
^reciated, any number of different computer software means may be designed to 



According to a still further aspect of the mvention, there is provided a computer 
program product for use in conjunctibn wilh a computer, said computer program 
comprismg a computer readable storage medium and a computer program mechanism 
embedded therein, the computer program mechanism comprising a module that is 
configured so fliat upon receivmg a request to quantify the mdividual contribution of a 
mutation or combination of mutations to the drag resistance phenotype exhibited by 
HIV, or to calculate the quantitative contribution of a mutation pattern to the drug 
resistance phenotype exhibited by an HIV strain, it perfonns the steps listed in any one 
of the methods of the invention described above. 

The invention further relates to systems, computer program products, business 
methods, server side and client side systems and methods for generating, providing, and 
transmitting the results of the above methods. 

25 The invention will now be described by way of example with particular reference to a 
specific algorithm that implements the process of the invention. As the skilled reader 
will appreciate, variations Irom this specific illustrated embodiment are of course 
possible without departing from the scope of flie invention. 
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BRIEF DESCaaPnON OF THE FIGURES 

Figure 1: Overview of measiured /predicted phenoljpehandlfl^^ 

Figure 2: Phenotype distribution of ritonavir for matching G/P samples in Virco 
database 

5 Figure 3: pFR distribution for saquinavir 

Figure 4: Distribution of pFR for •48V' mutation on saquinavir 

Figure 5: Distribution of pFR for '48V' mutation on saquinavir (expanded) 

Figure 6: Removing unwanted correlations 

Figure 7: global mutation tmjectories 

10 Figure 8: mutation trajectories for 711 

Figure 9: Example of genotypes, mutations relative to HBX2 

Figure 10: Example of phenotype analysis for ritonavir 

Figure 1 1 : Higher order interaction between mutations 82A and 84V 

Figure 12: mustration of iterative procedure for censored values 

15 Figure 13: Linear regression model identifies mutations included in IAS list Mutations 
mariced with an * are also identified by a regression on a 5% subset of the data 

Figure 14: Linear regression model identifies additional mutations previously described 
in the literature 

Figure 15: Predicted versus measured log(FC) 

20 Figure 16: Comparison between linear regression model and decision trees. 

Figure 17: Histogram of population left after removing aU virus strains during the 
iterations 

Figure 18: Trajectory of mutation 18H 

Figure 1 9: Residues as a function of the measured values 

25 Figure 20: Histogram of the residues as a function of the measured values 

Figure 21 : Histogram of the residues as a function of the measured values after 6 
parameters were taken into account 
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EXAMPLES 

Example 1: Methodology 

1.1 Introduction 

Hiis exercise involved the generation of a list of key mutations for each of the 
following drugs: Indinavir, Ritonavir, Saquinavir. Nelfinavir, Ampienavir. Lopinavir 

Zidovudme,Didanosine,Zalcit*ine.Stavudine,Abacavir,Lamivudine,Tenofovir ' 
Neviiapine,DelavirdjneandEfeviienz. ' 

The obtained Hst of key mutations is derived from a linear regression model using 
smgle mutations and couples ofmutations as independent variables. The dataset used 
for this analysis is an export of the Virco dataset at 2003/02/01 from the virvomimng 

tables. Table lshowslhematchinggenc^henDcountsfarea(Admg(e«dkphenolype 
measurement for a genotype counts as one measurement). 

Table 1: matching geno/pheno counts 



Drug 


Count 




Count 


Drug 


Count 


Ampcenavir 


29,508 1 


Lamivudine 


34.395 


Delaviidine 


32,450 


Indinavir 


34,445 


Abacavir 


32.744 


Efevirenz 


32,601 


Lopinavir 


7,410 


Stavudine 


34.420 


NeviraiMne 


34,738 


Nelfinavir 


34,470 


Zaicitabine 


34,539 






Ritonavn: 


34,502 


Didanosine 


34.227 






Saquinavir 


34,543 


Tenofovir 


14.591 








J 


Zidovudine 


33.575 







1^ Dataset 

The used dataset consists of a set of matching genotype/^jhenotype measurements with 
possible multiple phenotype measurements per genotype. The mutations are defined 
relative to HXB2 at amino acid level. The phenoiypes are presented 3spFJi values 
which IS equal to - log (FR), where FR denotes flie Fold Resistance. 
NegalivepFi? values denote resistance andpositive values denote hyper-susceptibility 
For example, fxpFR value of -1.0 is equal to 10-foId resistance. 

13 Linear regression 

For example, consider the following (artificial) model: 
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Table2 



Mutation 


Coefficient 


84V 


-0.46 


50V 


-0.92 


54M 


-0.64 


88S 


0.63 


90M 


-0.16 


46L 


-0.19 


46L&84V 


-0.09 



Consider a vims wiih following mutations: 31, 46L, 84V and 90M. This virus will have 
a joFIR prediction: 

pFR = 34^^.1+^84^.14- p46L.84r-l = "^'^ 



or almost 8-fold resistance. Note that in the model 46L and 84V are synergistic since 
their co-occurrence decreases fhepFR by an extra - 0.09. Note that in the model the 
mutation 31 shows no resistance coefficient assigned and Aerefore it is not considered 
10 in the pFR prediction equation. 

Figure 9 shows an example of four different genotypes (mutations relative to HBX2), 
whilst Figure 10 shows an example of phenolype analysis for RTV performed 
according to the meOiod of the invention. 

15 1.4 Modd creation 

Using our fecilities, it was computationally infeasible to calculate a full second order 
model on all possible mutations and second order terms, since the number of tenns 
increases quadratically with flie number of mutations considered. 
E.g.forAPV: 

• Total number of occurring mutations and couples of mutations: 1 9,074 

• mutations and couples with each at least 20 measurements: 4, 1 07 
In order to reduce the amount of terms, a first order regression was performed from the 
Ust of mutations that occur at least 20 times in the dataset. The significant terms from 
this regression were wifliheld and the list of these teims -except mutation 31 for some of 
the Prs, was used to perform a second order regression. A term is called significant if 
the probability that the real value of the term is 0, is smaller than 0.001 . A mutation 
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must occur at least 20 times and the inverse also has to be true: the count of viruses not 
^ving the mutetion 31 or the couple not 31 and another mutation should also be at least 
20. Takmg this into account results in excluding 31 from the regression in practice In 
the second order regression <mly the single mutations and couples of mutations are used 
that were significant in the first order model. 
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1.5 Example of model creation: ladinavir 

A first order regression is performed on the matching geno/pheno dataset (34 445 
measurements of which 28.480 unique Virco IDs) fi>r those mutations that o<iur at 
least 20 tmies. The resulting first order model withholds a list of 94 single mutations 
fliat are considered significant. Uns list (except 31) is used as a starting list fiw a 
second order regression. In fiis second order regression, aU the single mutations and 
all couples of mutations from the list are used as potential teans. The significant t«mis 
are wifliheld by the regression algorithm. 

1.6 The impact of cross-drug correlation on the significance level of mutations 
Conelahon between mutations that cause resistance to different drugs, has an impact on 
the confidence of the coefficient ^ fliis mutation. One of the effects is that for non- 
nucleoside reverse ttanscr^e inhibitors (NNRTIs) and nucleoside reverse 
transcriptase inhibitors (NRTIs), some non-relevant mutations fia fliat drug as 
sigmficant (though with a coefiicient close to 0), because drug resistance to the drug is 
correlated with drug resistance to drugs that bind at a different place. 

Note ftat this is only a problem finr inten>retation of the model, ^ot prediction of the 
Fold Resistance, the resulting model remams a good pFR predictor. 

1.7 Effects of second order terms 
• Antagonism 

Table 3 



Parameter 


pFR shift 


Count 








82A & 84V 


0.43 


395 








82A 


-0.27 


4845 








84V 


-0.26 


3531 
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Second order tenns can indicate a synergy or an atitagonism. In the example above the 
occurrence of either 82A or 84V cause a resistance shift, but the co-occurrence of both * 
mutations almost completely cancels out the effect of both mutations and shifts to 
susceptible ranges. In case bofli mutations are present, flie net pFR shift is only -0 43 
wWle it is -0.26 or -0.27 if only one of the mutations are present. This is an example of 
Strong antagonism. 

• Synergy 

Table 4 



Parameter 


pFR shift 


Count 


• • • 






241 


-0.22 


1022 








24I&73S 


-0.48 


30 








73S 


-0.20 


2216 









In fliis exanqjle, 241 and 73S both cause a resistance shift, but their co-occurrence 
causes a strong extra shift towards resistance. When only one of the mutations is 
present, ^t^pFR shift is -0.20 or -0.22. but thepresence of both mutations causes ^pFR 
shift of -0.48. 241 and 73S are thus strongly synergistic in this exan5>le. 
• Enhancement 
Tables 



Parameter 


pFR shift 


Count 






• a • 


321 


0 


821 




• • • 




32I&82A 


-0.26 


516 








82A 


-0.27 


4845 









20 



321 by itself does not contribute to resistance, but it increases the resistance for an 82A 
mutation. 321 enhances the effect of the 82A mutation. 

An exan^le of the effects of higher order interactions is shown in Figure 1 1. 
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1^ Highly corrdated mutotions 

HigUy oonelated mutations (i.e. mutations that almost always co-occur in a strain) can 
affect the results of a regression analysis For example, when one of these 2 mutations 
has an effect on resistance and the other mutation does not (Ihis mutation might for 
example be highly coirelaled with the resistance mutation because it affects flie 
repUcation rate of the virus), the efifect can be assigned to either one of the mutations. 
Unless this is compensated for, the regression model will assign the effect to that 
mutation that reduces the prediction error the most; which might not always be the 
mutation that is biolcgicaUy responsible for the effect Due to the correlation, it would 
otherwise not be possible to distinguish between these mutations. 

Another effect that occurs due to correlation is when a mutation is highly correlated 
witii a pair of mutations in which flie first mutation is present 
Table 6 



Parameter 


pFR shift 


Count 








58N 


-1.47 


108 








58N&77L 


1.16 


106 








77L 


0 


471 









In the above example, 108 samples have a 58N mutation and out of tiiese, 106 samples 
also have a 77L mutation. The effect of a pure 58N mutation can only be derived from 
tiie samples tiiat have 58N and do not have 77L, which leads to higher uncertainty on 
tiie estimatedpi^K shift of flie 58N mutation. The couple-term '58N & 77L' will 
compensate for a too low estimation of 58N by having a too high estimation for itspFR 
shift. 

Techniques are provided in tiie description to deal witii tiiese effects. The algoritimi 
developed by the inventors tracks tiie change in pFR as the effects of individual 
mutations or combinations of mutations are removed from flie dataset 
A comparison of tiie change in tiie global average pFR v^dth tiie change in tiie average 
pFR for selected mutations wifli mcreasing iterations of tiie algoritiim is shown in 
Figure 7. Figure 8 shows an example, where flie average pFR for 711 (unwanted 
correlation) jumps up as a result of removmg ftom tiie dataset virus strains fliat have 
'711 & 84V" and "48V" mutations. 
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Example 2: Iflustration of a stepwise regression ivith ampreiiavir 

^da^etof31 292rnatclung genotypes andpheno^es for a^^prenavk^ 

fest orderxegres^on wasperfonned, andthe first 11 iterations are swiere iT 

inventors selected asp-values: 

p-value entry = 0.001 

p-vaJue stay = 0.5 

variable with the lowest p-vatoe. l4 S>?Sv P^^^ 
with this variable (84V). Themo^rsh^L ' ^""^^"^ ^"^^ 

Table?. 

Variable P084_y Entered: RSguare = 0M77and C(p) - 43445.43 
Table 7 





Analysis of Variance 


Source 


1 DF 


Sum of Squares 


• ■ 

Mean Square 


1 F Value 


Pr>F 


Model 


1 


' 2042.55324 


2042.55324 


10301.6 


<.0001 


Error 


31289 


6203.83650 


0.19828 






Corrected Total 


31290 1 


8246.38974 








Variable 


Parameter Estimate T Standard Error 


1 Typenss] 


F Value] 


Pr>F 


Intercept 


-0.00318 0.00269 


0.27810 


1.40 j 


0.2363 


P084 V 1 




-0.81754 0.00805 


2042.55324 


10301.6 [ 


<.0001 [ 









The model is show T ^ "^^^ ^ ^""^"^ "^"^ ^ P082 A. 

Table 8 below. 

Viable P082_A Entered: R^quare = 0.3538 and Cip) =32908.89 
Table 8 





Analysis of Variance 


Source 


DF j Sum of Squares 


Mean Square 


P Value 


Pr>F 


Model 


2 [ 2917.39937 


1458.69968 


8564.44 


<.0001 
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Analysis of Variance 


Source 


DF 


Sum of Squares 


ivLean square 


T? 'XT t 

F Value 


R">F 


lExtot 


31288 


5328.99038 


0.17032 






Coxrected Total 


31290 


8246.38974 








Variable 


Parameter Estimate 


Standard Error 


1 Typenss 


F Value 


Pr>F 


Intercept 


0.07033 


0.00269 


1 116.28598 


682.75 


<.0001 


P082_A 


-0.47905 


0.00668 


874.84612 


5136.47 


<.0001 


P084_V 


-0.82827 1 


0.00747 


2095.67091 


12304.3 


<.0001 



In flie next iteration, we repeated the previous process except that the influence of 84V 
and82Aisiiowiemoved. 1b this third run. the mutation with the lowest p-values 



P090M. 



was 



It may h^pen that after some iterations, a mutation which was found first significant is 
not significant anymore, and as such it is removed from the model. 

The resistance coefficient is adjusted after every iteration because of flie addition of a 
new vanable to the regression. E.g. for P84V, the resistance coefficient p changes from 
-0.81754 to -0.82827. gesnom 

The foUowing iterations were done in the same way. The results obtained are here 
below enclosed. 

yaMleP090Ji Entered: R-Square^ 0.4245 and C(p)^ 25885.06 
Table 9 



15 





Analysis of Variance 


Source 


DF 


Svun of Squares 


Msan Square j 


F Value 


Pr>F 


Model 


3 


3500.64130 


1166.88043 1 


7692,82 


<.0001 


Error 


31287 


4745.74844 


0.15168 1 






Corrected Total 


31290 


8246.38974 









ll Variable Parameter Estimate | Standard Error ] TypeHSS || F Value | Pr>F | 
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Variable 


Parameter Estimate 
0.136S0 


uMcuxmuu x^JTOr 


lypeH SS 


F Value 


Pr>F 
<.0001 


Intercept 


0.00276 


373.40964 


[2461.75 




-0.40431 


0.00642 


601.20353 


3963.52 1 


<.0001 


|P084_V 


-0.64830 


0.00762 


1097.72370 


7236.89 


<.0001 


P090JV1 


-0.33549 


0.00541 ^ 


583.24194 


3845.10 1 


<.0001 


^^^f/^^^-^^^^- R-Square = 0.4729 and C<p) = 21082.6i 









1 lypenss 


F Value 


Pr>F 


398.56350 


2868.58 


<.0001 


398.83323 


2870,52 


<.0001 


460.87758 


3317^^ 


<.000I 


954.28785 


6868.28 1 


<.O0Ol 


483.38315 


3479.05 1 


<.O001 



Variable P046 J Entered: RSquare = 0.5110 andC(p) = 17296.94 
Table 11 





Aiuaysis of Variance == 


Source 


DF 1 


Sum of Squares 


Mean Square 


F Value 


Pr>F 


Model 


5] 


4213.90735 


842.78147 


6538.51 


<.0001 


Error 


31285 1 


4032.48240 


0.12890 






Corrected Total 


31290 


8246.38974 

















BEST AVASLABLg COPY 
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Variable 


Par^eter Estimate 


Standard Error 


489.12701 


F Value 


Pr>F 


Intercept 


0.15801 


0.00257 


3794.77 


<.0001 


P033_F 


-0.59310 


0.01084' 


386.19290 


2996.18 


<.00Ol 




-0.32298 


0.00654 


314.43281 


2439,45 
3963.31" 


<.O0Ol 


P082 A 


.... -9-12721 


0.00601 


*-;.'.;j-i.-sr;;54:?.-r:/:; 

381.95629 


<Joooi 


P084_V 


-0.53666 


0.00721 


714.53640 


5543.55 


<!ooor 


poVd M* 


-0.25737 


0.00511 


326.53882 


^2533.37'' 


^<o6orj 



ronai/e P047_V Entered: RSquare = 0.5257 a«i/ Cfe^ = j 5544^ jq 
Table 12 



5 





Analysis of Variance 


Source 


DF 


1 Sum of Squares 


Mean Square 


F Value 


Pr>F 


Model 




1 4359.53390 


1 726.58898 


5848.07 


<.0001 


Eiror 


31284 


1 3886.85584] 


0.12424 






Corrected Total 


31290 


1 8246.38974 









■ 



Variable 


Parameter Estimate 


Standard ErtoT 


TypeHSS 


F Value 


Pr>F 


Litercept 


0.15941 


0.00252 


497.68905 


4005.73 


<.00Ol 


P033_F 


-0.55617 


0.01069 


1 336.13339 


2705.42 


<.0001 


P046_I 


-0.27816 


0.00655 


223.90478 


1802.13 


<.0001 


P047_V 


-0.63671 


0.01860 


145.62656 


1172.10 


<.0001 


P082_A 


-0.32763 


0.00590 


382.93722 


3082.13 


<.00Ol 


P084_V 1 


-0.54696 


0.00708 


740.88446 


5963.13 


<.0001 


P090_M 1 


-0.25573 1 


0.00502 


322.35883 


2594.56 


<.O0Ol 



Variable P046_L Entered: RSquare = 0.5409 and C^) = 14326.32 
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Analysis of Variance 



Source 
Model 



Error 



Corrected Total 



DF 



31283 
31290 



Sum of Squares Mean Square 



4460.84208 637.26315 



3785.54767 



0.12101 



824638974 



F Value 



5266.21 



Pr>F 



<.0001 



Variable 


Parameter Estimate 


Standard Error 


TypensS 


[f Value 


1 Pr>F 


Intercept 


0.16438 


0.00249 


526.67853 


1 4352.36 


1 <.OO01 


P033_F 


-0.54921 


0.01056 


327.61200 


1 2707.32 


<.0001 


P046_I 


-0.31345 


0.00658 


274.55948 


1 2268.90 


<.0001 


P046_L 


-0.27769 


0.00960 


101.30817 


837.19 


<.0001 


P047_V 


-0.63940 


0.01835 


146.85706 


1213.60 


<.0001 


P082_A 


-0.26575 


0.00620 


222.00011 


1834.56 


<.0001 


P084_V 


-0.52972 


0.00702 


689.88420 


5701.06 


<.0001 


P090_M 


-0.24267 


0.00498 


287.89278 


2379,09 


<.0001 



Variable P050_V Entered: R-Square = 0.5534 and C{p) = 13094.82 



Table 14 





Analysis of Variance 


Source 


DF 


Sum of Squares 


Mean Square 


F Value 


Pr>F 


Model 


8 


4563.24026 


570.40503 ' 


4844.61 


<.0001 


Error 


31282 


3683.14948 | 


0.11774 






Corrected Total 


31290 


8246,38974 | j 






j Variable \ Parameter Estimate Standard Error Type H SS | 


F Value 1 


Pr>F I 
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Variable 


Parameter Estimate 


DiftTiflaTQ xsrror 


TypelISS 


1 F Value 


1 ^ ~~ 


Intercept 


0 16620 




538.66106 


4575.00 


<.0001 
<.0001 


P033_F 


-0 51815 


U.Ul U4o 


288.64295 


2451.52 


P046_I 


-0 29956 


U«UUODi 

- 


249.44262 


2118.58 


<.0001 


P046_L 


-0.27934 


0.00947 


102.51249 


870.67 


<.0001 


P047_V 


-0.64754 


0.01811 


150.58308 


1278.94 




P050_V 


-0.82797 


0.02808 


102.39819 


869.70 


<.0001 


P082_A 


-0.25944 1 


0.00612 


211.32860 


1794.87 


<.0001 


P084_V 


-0.53915 1 0.00693 


713.14511 


6056.94 


<.0001 


P090_M 


-0.24415 1 0.00491 


291.39382 


2474.89 


<.0001 



Variable P054_M Entered: R-Square = 0.5643 and C^) = 12005. 72 
Table 15 





Analysis of Variance 


Source 


DP 


Sum of Squares 


Mean Square 


F Value 


Pr>F 


Model 


9 


4653.81644 


517.09072 


4502.38 


<.0001 


Error 


31281 


3592.57330 


0.11485 






Corrected Total 


31290 


8246.38974 









5 



1 Variable 


fparameter Estimate 


Standard Error 


Type n SS 


F Value 


Pr>F 


Intercept 


0.16633 


0.00243 


538.92765 


4692.51 


<.0001 


P033_F 


-0.46793 


0.01049 


1 228.56387 


1990.14 


<.0001 


P046_I 


-0.30134 


0.00643 


252.40077 


2197.69 


<.O0Ol 


P046_L 


-0.28508 


0.00935 1 


106.71653 


929.19 


<.0001 


1 P047_V 


-0.55613 


0.01818 


107.50855 


936.09 


<.O001 


P050_V 


-0.84757 


0.02774 


107.23423 


933.70 


<.O0Ol 



wo 2004/111907 



PCT/EP2004/051084 



-34- 



Variable 


Parameter FQiimofA 


■stanciaTo cnor 


TypenSS 


F Value 


Pr>F 


P054_M 


-0.57119 


0.02034 


90J7618 


788.66 


<.0001 


P082_A 


-0^6050 


0.00605 


213.05431 


1855.09 


<.0001 


P084_V 


-0.53337 


0.00685 


6973 1870 


6071.64 


<.0001 


P090_M 1 


-0.23454 1 


0.00486 


267.57375 


2329.80 


<.0001 



Variable P054_L Entered: RSquare = 0.5740 and C^) = 11047.41 
Table 16 



Analysis of Variance 



Source 


DF 


Sum of Squares | 


Model 


L 


4733.53615 


Error 


31280| 


3512.85359 


Corrected Total 


31290 1 


8246.38974 | 



473.35362 
0.11230 



F Value Pr>F 



4214.95 <.0001 



Variable 


1 Parameter Estimate 


Standard Error 


1 Type n SS 


F Value 




Pr>F 


Intercept 


1 0.16756 


0.00240 


546.67520 


4867.84 


<.0001 


P033_F 


-0.40056 


0.01068 


1 158.09630 


1407.76 


<.0001 


P046 I 


-0.30750 


0.00636 


262.46469 


2337.10 


<.0001 


P046_L 


1 -0.28311 


0.00925 


Uo5.23686 


937.08 


<,0001 


P047_V 


-0.52353 


0.01802 


94.83277 


844.43 


1 <.0001 


P050_V 


-0.87024 


0.02744 j 


112.94107 


1005.68 j 


<.0001 


P054_L 


-0.45611 


0.01712 


79.71971 


709.86 


<.0001 


P054_M 


-0.61432 


0.02018 


104.09593 


926.92 1 


<.0001 


P082_A 


-0.26465 


0.00598 


219.74203 


1956.68 1 


<.0001 


P084_V 


-0.51838 


0.00679 1 


654.14626 


5824.81 


<.0001 


P090_M 


-0.22560 1 


0.00482 1 


246.35798 


2193.68 


<.0001 



Variable P088_S Entered: RSquare = 0.5834 and C(p) = 10116.87 
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Analysis of Variance 



Somce 


1 DF 


Sum of Squares 


Mean Square 


F Value 


Pr>F 


Model 


L 


4810.94944 


437.35904 


3982.07 


<.00Ol 


Error 


31279 


3435.44030 


0.10983 






Cbnected Total 


31290 


8246.38974 









Variable 


j Parameter Estimate 


Standard Error 


1 Typenss 


F Value 


Pr>F 


Intercept 


0.16239 


0.00238 


1 510.05266 


4643.93 


<.0001 


P033JF 


-0.39933 


0.01056 


157.11889 


1430.54 


<.0001 


P046_I 


-0.32619 


0.00633 


291.69184 


2655.80 


<.00Ol 


P046_L 


-0.29089 


0.00915 


110.98928 


1010.54 


<.0001 


P047_V 


-0.50941 


0.01782 


89.70713 


816.77 


<.0001 


P050_V 


-0.86163 


0.02714 


110.70009 


1007.90 


<.0001 


P054_L 


-0.46062 


0.01693 


81,29492 


740.17 


<.0001 


P054_M 


-0.61674 


0.01995 


104.91704 


955.25 


<.0001 


P082_A 


-0.25645 


0.00592 


205.77424 


1873.53 


<.0001 


P084_V 


-0.51212 


0.00672 1 


637.66342 


5805.80 


<.0001 


P088_S 


0.57942 


0.02182 { 


77.41329 


704.83 


<.0001 


P090_M 


-0.22099 


0.00477 1 


236.06351 


2149.31 


<.0001 



Example 3: Dealing with censored values 

In this example, the method developed by the inventors to deal with censored values 
was applied for the drug amprenavir. Firstly, a linear regression model without 
censored values was calculated, see the values ' APVjpFR' for iteration 0 in liie Tables 
18-21 below. Hie phenotypic measured valueis < -2.083062017 for Virus strain 1 for 
the first iteration, the value Vo is equal to the measured phenotype value APVjpFR, but 
the censor is considered as '='. Hiis was followed by iteration nr. 1. Once the values 
were obtamed fiom iteration 1, the inventors compared the prediction value P e g for 
Virus strain 1, -2.362076895 with the phenotypic measured value <.2.083062017 For 
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virus steains 1, 2 and 3 below, the measured values have a censor '<'. For virus strains 

4, the nieasuied value has a censor '>'. The scenarios established by the inventors 

when the case was '<'-censor, and when Ihe case was '>'-censor, were applied New 

linear regression models were calculated and for the censored values in the linear 

regression model, either the data-points ftom the training set were removed, as for virus 

strains 1 , 2 and for virus strain 4 (iteration 2 and 3 only), or those data-points were used 

mstead of the censored phenotypes measurement as illustrated for vims strain 3 and for 

vmas strain 4 (only iteration 1). The procedure was reiterated until the prediction 
converged. 
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Virus strain 1: CASE gndP < V - Q 705^ » 
V - 0.798 o = -2.282562017 



Table 1 


8 


ITER 


APV censor 


APV pFR 


Prediction P 


value 


0 


< 


-2.083062017 




-2.083062017 


1 


< 


-2.083062017 


-2.362076895 




2 


< 


-2.083062017 


-2.537358188 




3 


< 


-2.083062017 


-2.550836421 




Vims strain 2: CASR '<' and P < V . 


0.798 a 





V - 0.798 o = -2.108607374 
Table 19 



ITER 


APV censor 


APV pFR 


Prediction P 


value 


0 


< 


-1.909107374 




-1.909107374 


1 


< 


-1.909107374 


-2.608207782 




2 


< 


-1.909107374 


-2.743936505 




3 


< 


-1.909107374 


-2.748739259 





Vims strain 3: CASR and V<= P 
V - 0.798 o = -2.156216334 
Table 20 



ITER 


APV censor 


APV pFR 


Prediction P 


value 


0 


< 


-1.956716334 




-1.956716334 


1 


< 


-1.956716334 


-1.343253355 


-2.016673816 


2 


< 


-1.956716334 


-1.401494738 


-2.021216328 


3 


< 


-1.956716334 


-1.405764249 


-2.02157333 
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Viius strai n 4: CASP. '>* and V < P <= y -f n 70« « 
V + 0.798 a = 0.714928642 
Table 21 



ITER 


APV__censor 


APV pFR 


Prediction P 


value 


0 


> 


0.515428642 




0.515428642 


1 


> 


0.515428642 


0.671585565 


0.714928642 


2 


> 


0.515428642 


0.759236384 




3 


> 


0.515428642 


0.770724812 





Example 4: Mutations trajectories 

A methodology for removing unwanted conelations involved an algoritbrn developed 
by ihe inventors in which the change in pER was tracked as tiie effects of individual 
mutationsorcombinationsofmutationswereremovedftomthedataset Theeffectof 
each mutation or combination of mutations was separated out. The methodology 
followed mutation trajectories towards the global average as the effects of individual 
mutations or combinations of mutations were removed. 

In a first step, the average pFR was calculated for all mutations with a sufficient count 
in flie database to be significant, i.e. >20. In the Table below fee first and last 10 
mutations of a Ust of 368 mutations is shown with tiie corresponding calculated average 
pFR. ^ 

Table 22 



Label 


Count 


Average pFR 


54M 


239 


-1.35421631044816 


50V 


129 


-1.28868671545329 


47V 


281 


-1.27082426054035 


84A 


21 


-1.23536347179756 


76V 


187 


-1.20328089469875 


89V 


269 


-1.19273200093637 


91S 


36 


-1.15103609689224 


33M 


22 


-1.12268052165089 


84C 


21 


-1.11529918764429 


54L 


327 


-1.09959239280592 



18L 



65 



0.12143300387582 
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U.121561573218861 






0. 1 3434449 1 9 96282 


33V 


406 




17E 


86 


0.154734960149477 


631 


20 


0.159585638802379 


62M 


22 


0.184594303565984 


19S 


21 


0.202798973752203 


I5L 


51 


0.213932584381636 


88S 


204 


0.494820688128991 



The extremes were determined, and the mutation with the pFR furthest away fiom the 
global average was selected. In Table 22. the mutation selected was 54M. 
Following, aU virus strains that had the selected mutation were removed, in total it 
amounted to 283 virus strains: from 26738 to 26455. A new 

Table23ofaveragepFRwasgeneiatedfQrtibeiemainingmutatians. The Hst below 
shows the first and last 10 mutations of the obtained results. 
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Table 23 

New table of average per mutation: 



Label 


Count 


Average pFR 


50V 


129 


-1.28868671545329 


84A 


21 


-1.23536347179756 


76V 


168 


-1.13676536523909 


47V 


215 


-1.12849822159876 


91S 


35 


-1.12648669210251 


33M 


22 


-1.12268052165089 


84C 


21 


-1.11529918764429 


54L 


327 


-1.09959239280592 


89V 


213 


-1.05990043312845 


33F 


852 


-0.988257531387902 




18L 


65 


0.12143300387582 


69K 


1384 


0.125380073379929 


89M 


1256 


0.141136938991525 


33V 


404 


0.147380396015872 
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17E 


86 


0.154734960149477 


631 


20 


0.159585638802379 


62M 


22 


0.184594303565984 


19S 


21 


0.202798973752203 


15L 


51 


0.213932584381636 


88S 


202 


0.494762597743788 



me previous steps were reiterated, m Table 24 there is listed the mutations which 
were selected at different iteration counts. 
Table 24 



Count 


Mutation selected 


26738 


54M 


26455 


50V 


26296 


84A 


26272 


91S 


26220 


47V 


25947 


76V 


25762 


54L 


25365 


22V 


25260 


33F 


24545 


321 


24183 


84V 


21784 


54S 


21728 


82F 


21577 


54T 


21427 


24F 


21397 


241 


20894 


73T 


20702 


73C 


20604 


55R 


20324 


95F 


20194 


lOR 


20122 


54A 


20062 


82C 


20039 


88S 


19818 


46L 
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mutation selected 




231 


101 on 


con 

58E 


1 fiono 


67F 


loo// 


73S 


18105 




17920 


lOF 


17568 


82A 


16637 


461 


16215 


54V 


16028 


35G 


15996 


90M 



The procedure was stopped when the last selected mutation had an average pFR that 
approximated to Ibe global average. In Figure 17, a histogram of the population left 
after removing all virus strains during the iterations is shown. InFigurelS the 
trajectory of mutation 18H, which has no significant phenolype, demonstrates ti»e 
underlymg cause that a virus strain is resistant due to other mutations than 1 8H f 54M 
76V, 33F, 84V). 



Example 5: Highly correlated mutations 

A metiiodology for removing unwanted correlations proceeded as foUows. In a first 
stage, the correlation coefficient between all mutations (with a sufiScient count in tiie 
database, i.e. >20) and the pFR was calculated. In Table 25 below there is Hsted the 
first and last 10 coefficients of a Ust of 202 coefficients tiiat were calculated. 
Table 25 



NAME 


correlation with APV pFR 


P084 V 


-0,51430323209603 


P090 M 


-0.510166810424169 


POlO I 


-0.43843660921923 


P046_r 


-0.430221405121849 


P071 V 


-0.400337497273138 


P033 F 


-0.385947043862637 


P082 A 


-0.343946685713202 


P054 V 


-0.339269018202483 


P032 I 


-0.275105400287351 
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NAME 


coirelation with APV pFR 


POlO F 


-0.263412991920486 



10 



15 



P065 D 


0.044821230615045 


P012 A 


0.0476139542192145 


POM R 


0.0485859961407927 


P033 V 


0.0571605219086003 


P041 K 


0.0713497764580897 


P063 S 


0.071416853283617 


P030 N 


0.080199177466301 


P089 M 


0.0904859904655573 


P069 K 


0.0926109439533826 


P088 S 


0.101920449277774 



Consequently, the extremes were deteimined (maximmn, imnimum). and the mutation 
with the highest (absolute value o^ correlation coefficient was selected. In this case 
was P084_V. 

In the following step, a linear model for the pFR with the selected mutation(s) (fiom 
step 2, all previous iterations) was calculated. The predicted model obtamed was 
Predicted pFR = -0.844 * M84 

FoDowing, the residue was taken (pFR minus the predicted value fiom the model). In 
Figure 19 a gr^h of the residues as a function of the measured values is shown. In 
Figure 20, tiie same gr^h is represented in the form of histograms where the 
distribution of the residue may be observed. 

Hien, the correlation coefficient between aU mutations with a sufficient count in the 
database and the residue was calculated. Results of the first and last 10 variable are 
shown here below. It wiU be observed that the order of the mutations had changed 
because the influence of mutations P084_V had been removed. 
Table 26 



NAME 


Correlation with Residue 


P082_A 


-0.402088216120049 


P090 M 


-0.373018037350597 


P033 F 


-0.347337204147266 


P032 I 


-0.325689507827784 
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P046 I 


-0.324078296352946 


POlO I 


-0.321343698855709 


P054 V 


-0.311246000758214 


P071 V 


-0.290388889036061 


P047 V 


-0.272571613897309 


P046 L 


-0.234148912637279 



P012 S 


0.0358242687063381 


P014 R 


0.0401577992704191 


P012 A 


0.0442436406740132 


P063^ S 


0.0484345347109348 


P033 V 


0.0537012592855092 


P030 N 


0.0548222080117112 


P041 K 


0.0659977808671604 


P089 M 


0.078693680117922 


P069 K 


0.0835173137603877 


P088 S 


0.110781153455097 



10 



Consequently, the extremes we«e det«nnined again, and the mutation with the highest 
absolute value of the conelation coefficient was selected. The mutation selected now 
wasP082_A. 

A new linear model for the pFR with flie selected mutation P082_A, was calculated. • 
pFR = -0.782*M84 + -0.435*M82 

After 6 iterations, the following resistance coefficients for a selected group of 
mutations was obtained. 

Table 27 



Parameter 


3 


P084 V 


-0.512345233450494 


P082^A 


-0.247464800298171 


P090 M 


-0.156011849225004 


P033 F 


-0.532092050020052 


P046 I 


-0.205719880528639 


P047_V 


-0.656374746817602 
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Jn Figure 21, a graph is represented in ihe form of histograms showing the residues 
after ihe 6 parameters were taken into account. 

The reiterative procedure continued untQ the last selected mutation had a correlation 
coefficient ttiat approximated to z&ro. 



Example 6: Results from linear legression modeling 

In an initial test, genotypes and corresponding phenotypes determined for ritonavir 
(RTV) for 28,540 HIV-l clinical isolates were used. The linear regression analysis 
identified 20/22 RTV resistance-associated mutations described in Ihe IAS mutation list 
m except lOF and 771) (see Figure 13). Additional mutations whose efect on RTV 
suscepbbiLty had been previously described (eg. 73S/r/C, 84A/C and 88D) were also 
Identified (Figure 14). Overall, 53 single mutations and 96 pairs of mutations were 
Identified as having significant effect on susceptibihty to RTV. 

The predicted phenolype was compared to flie measured phenotype m a leave-cme-out 
cross-vahdation, demonstrating a root mean square error of 0.31 OogFR) (see Figure 
15. The error rate of the linear modeling method [5.62o/o (sensitivity=93.0o/o, specificity 
- 95.4%)], compared favourably to a decision tree-based model [Beerenwinkel PNAS 
99. (2002) 8271-8276] [10.2o/o (sensitivity=89.8%, specificity=89,7%)] (see Figure 16). 
The robustness of the algorithm as a function of ihe size of the input dataset was 
assessedusing smaller subsets of data. Nine of 22 IAS resistance-associated mutations 
for RTV could be identified with subsets > 5% (1 600 isolates) of the original data 
However, the accuracy of flie predicted contribution of the mutations improved with 
mcreasmg dataset sizes ,^ to 50% of Ihe origmal database (median standard error of 
the predicted contributions decreased 50%). Some secondary mutations (e.g. lOR, 321, 
82S) were identified as havmg a significant contribution to resistance only when the 
subset size reached a similar 50% level 



Example 7: Comparison of genotype-to-phenotype prediction for different 
Artificial Intelligence techniques 

Analyses were performed on matching genotype4)henolype datasets for all 16 HIV 
inhibitors currently available. The matching genotype/phenotype datasets consisted of 
approximately 30,000 data points for most drugs except Lqpinavir and Tenofovir. 
As an example, the following results for Ritonavir were obtained: 
• A log(F/?) root mean squared error of 0.3 1 (MSB=0.096). 
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• A classification eiror of 5.6% (with regaxd to a standard cutoff ^lied by the 
inventors of Fii=3.5). 

(sensitivit3F=93.0%, specifici1y=95.4%). 

• Regression model identified 53 single mutations and 96 pairs of mutations. 20 
of 22 mutations of the IAS list are confirmed in this model. 

Table 28 



out 







CV-MSE' 
(inlog(3FR)) 


Classification 
error 


Nr of samples 


Coverage 


Model 
insnectinn 


Nexiral network 
(Lopinavirf*^ 


0.88 


n/a 


n/a 


1,322 


100% 


possible 
N 


Support Vector 
Machine 

(RitoJiavir)^ 


0.79 


0,176 


n/a 


652 


100% 


N 


Support Vector 
Machines {Tibotec 
data, Ritoncnnr) 


0.81 


0.144 


n/a 


17,453 


100% 


N 


Decision tree 
(Ritonavir^ 




Classification 
only 


10J2% 


469 


100% 


Y 


Clustering 
(Iftdinavir) ^ 


n/a 


fi/a 


15.3% 


1,152 


100% 


N 


Self Organizing 
Map (Saquinavir^ 




Classification 
only 


15% 


811 

(38 matching) 


84% 


N 


Linear regression 
modeling (Pzrco 
data. Ritonavir) 


0.88 


0.096 


5,6% 


34,502 


100% 


Y 



(*) CV-MSE: cross-validation mean squared error. Depending on the analysis, it is a 
leave-one-out or a 10-fold cross-validation 

[1] A28.MutationNeuralNetworkModellhatAccuratelyPredictsPhenotypic 
Resistance to Lopinavir (LPV) 

D Wang. R Hairigan and BA Urder, Antiviral THerapy 2001 (Supplement 1): 105 
[2] Predicmgmv Drug Resistance WiOi Neural Networlfs 

S Draghid, R Potter, Bioinfoimatics, Vol. 19 no. 1, 2003 (p. 98-107). 
[3] Geno2pheno: Interpreting Genotypic HIV Drug Resistance Tests 

N Beerenwinkel et al.. Intelligent Systems in Biology, Nov/Dec 2001 . 
[4] Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes 

N Beerenwinkel et al.. Nucleic Acids Research, 2003, Vol. 3 1, No 13 
[5] Predictingphenotypefrom genotype: a comparison of statistical methods 

A Foulkes et al. 
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Neural networks, support vector machine and clustering techniques are no suitable 
candidates for a quantitative prediction system with the requirement that such a system 
should have a descr^tive power in order to be inspeclable by experts and to be able to 
motivate certain predictions to customers. Decision trees allow experts to have insight 
in how the decision process works. But decision trees require massive amounts of data 
to build complex trees, since flie information in the data is not optimally used. 
Based on tiie performed analyses and the linear regression modelling feasibiHty study 
linear regression modelling seems to meet the requirements for a quaotitative prediction 
system. Linear regression seems to ou^erfonn other examined techniques and offers 
the possibility of inspection to HIV-experts. Considering the lower error rates, the 
linear regression modelling allows the optimisation of drug therapy in patients. 

Example 8: Comparison of the regression linear model with a nries-based 
algorithm 

The regression linear model developed by the inventors was used to test the drug- 
associated resistance of one HIV-1 sample on nevirapine, delavirdme. and efevirenz. 
In parallel, the same sample was run in a niles-based algorithm methodology as 
described in WOOl/79540. 

The results were expressed in FR, or fold change in IC50 or EC50, relative to reference 
wild-type virtis, which is the drug concentration at which 50% of the enzyme activity is 
inhibited, and is expressed in yM units. The FR cutoffs values for normal susceptible 
ranges were 8, 10, and 6, for nevirapine, delavirdine, and efavirenz, respectively, 
■me phenolypic antiviral experiment was taken as the gold-standard by tiie inventors. 
Said phenolic antiviral experiment was perfonned as described in WO97/27480. 
Results of the three mefliodologies are enclosed in Table 3 below. 
According to tiie results obtained by flie rules-based algorithm, tiie patienf s sample 
wouldbe susceptible to all three drugs. However, when the results obtained by the 
linear regression model were considered, the sample showed resistance against 
nevirqrine and efevirenz and susceptibility against delavirdine. These last results were 
confirmed by the phenotypic antiviral experiments. 

Table 29 



drug 


Rules-based algorithm 


Linear regression 
model 


Phenotypic antiviral 
experiment 




Matches in 
database 


FR 


ER 


FR 
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Neviiapme 


321 


3.6 


176.1 


>89 


delavirdine 


301 


1.9 


3.5 


3.7 


efavirenz 


300 


1.9 


12.9 


17.5 



